-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dirichlet noise added to prior probabilities during self play #186
base: master
Are you sure you want to change the base?
Conversation
Thanks for sharing your code!
|
I made edits to my code to add in your remarks. Let me know if you believe it is behaving properly now. |
I was able to get a Connect4 agent playing perfectly as far as I have explored. It has 100% wins moving first in the evaluation loop. It also does not make any moves that would result in a loss or a draw when I play against it with the aid of https://connect4.gamesolver.org/. I used the updated dirichlet algorithm I made with NMO13's input. Other changes to achieve this: My config was as follows: I don't know the exact iterations due to having to restart the code because of memory leaks in my current implementation of self play parallelization, but it was probably ~75. |
I believe I spoke too soon. The second player has begun playing alternative moves, allowing the training to continue for new types of positions. |
I added some comments to the code.
Sounds interesting but I would suggest that in a different PR bc. otherwise it gets too messy. And please resolve the conflicts, thanks. ;-) |
Thank you once again for your comments. I removed the bug (quite a large one) caused by calling self.dirichlet_noise instead of dirichlet_noise in the code I committed to my pull request. I also fixed this in the code for the model I am training now. I saw the results of your dirichlet_noise experiment and am glad the evidence matches our intuitions. How similar is your code for implementing dirichlet noise? Also let me know what you think about the i==0 and leaf node situations. |
Pretty similar. The main difference is that I don't override Ps, as already mentioned.
Notice the np.copy. And then I do:
That should assure that the probabilities don't drift too far away. |
I think I like my implementation of dirichlet noise more, because I think it is more likely deep mind implemented it like I did based on my understanding of the paper. It seems to me to be better to search a path repeatedly in depth (based on if that path got a big increase in P from the dir noise). If you randomly assign dir noise for each simulation, I think the benefit will be more like adding uniform noise because the variation of the noise will be less with the repeated sampling. Do you agree or disagree? |
Ok so my inutition was to add, as you already mentioned, more like unifrom noise. Just to make sure that the other paths are not "starving". I am just thinking that your solution of repeatedly overriding Ps could mess the probabilities too much up. But of course, I might be wrong and your solution works even better. |
I am currently training your version. I am curious how it will perform against my version. |
To minimize variables, you should use the dirichlet alpha you used to train your other model. I got very far with 1.4, but it got stuck for ~15 iterations on all ties. I just dropped the alpha to 0.9, so hopefully this is enough variation for it to find the optimal solution. What alpha did you use when you trained? |
I used 0.6. And for the first 100 iterations it finds a better model every ~4 iterations. Comparing your version vs mine: Comparing your version vs the original one: Conclusion: |
Copy paste from suragnair/alpha-zero-general#186
if dirichlet_noise: | ||
self.applyDirNoise(s, valids) | ||
sum_Ps_s = np.sum(self.Ps[s]) | ||
self.Ps[s] /= sum_Ps_s # renormalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applyDirNoise
has already calculated a normalized sequence when it added two normalized sequences (scaled by 0.75 and 0.25, respectively), so there is no need to renormalize Ps[s]
here – since doing so will have no effect
I don't completely understand the mathematics of the dirichlet distribution, but I arbitrarily chose an alpha of 0.6 as the default value for dirichletAlpha in main.py because that seems approximately right for othello. The one thing I am wondering about my code is that I am not sure when to apply it to the prior probabilities generated by the neural network for a board s. Currently I am applying it at all the search board nodes because otherwise only the first move would get the dirichlet noise due to the caching of the prior probabilities and policies.
But the paper seems to imply it should only be applied at the root node:
Thoughts and feedback are appreciated.